The dataset I’ve decided to explore in this project is Red Wine data. The dataset contains information on the different red wine characteristics such as acidity, sugar, pH, and alcohol%.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 46 46 4.6 0.52 0.15 2.1
## 96 96 4.7 0.60 0.17 2.3
## 132 132 5.6 0.50 0.09 2.3
## 133 133 5.6 0.50 0.09 2.3
## 143 143 5.2 0.34 0.00 1.8
## 145 145 5.2 0.34 0.00 1.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 46 0.054 8 65 0.9934 3.90
## 96 0.058 17 106 0.9932 3.85
## 132 0.049 17 99 0.9937 3.63
## 133 0.049 17 99 0.9937 3.63
## 143 0.050 27 63 0.9916 3.68
## 145 0.050 27 63 0.9916 3.68
## sulphates alcohol quality
## 46 0.56 13.1 4
## 96 0.60 12.9 6
## 132 0.63 13.0 5
## 133 0.63 13.0 5
## 143 0.79 14.0 6
## 145 0.79 14.0 6
This first table is a quick view of a list of wines that have an alcohol content of greater than 11%.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This summary shows some introductory information about each column and the range of values they carry.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
This chart explores the free sulfur dioxide content amongst the dataset. Here, we can see that the vast majority has a free sulfur dioxide content of less than 20.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
This chart explores the total sulfur dioxide content amongst the dataset. Here, we can see that the vast majority has a total sulfur dioxide content of less than 50.
## Warning: Removed 1 rows containing missing values (geom_bar).
This plot shows the number of wines by pH level, and it appears most wines are between a pH of 3 and 3.5.
## Warning: Removed 71 rows containing non-finite values (stat_bin).
This plot shows the count of wines per density, ranging from 0.99 to 1, showing in bins of 0.001.
## Warning: Removed 21 rows containing non-finite values (stat_bin).
This chart explores volatile acidity in the dataset. It appears almost like a normal distribution!
This plot shows the amount of wines for each fixed acidity. We can see that the majority of the wines seem to be between 7 and 8.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
This shows the number of wines throughout the different levels of alcohol content. The overwhelming majority of wines are between 9% and 10%.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This histogram shows a quick summary of the number of wines in each quality rating. We can see that there is the majority of wines are rated at a 5 or 6. I wonder what it takes to receive a rating of 8?
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1430 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
The plot above shows the majority of wines with a chloride level between 0.1 and 0.12.
## Warning: Removed 8 rows containing non-finite values (stat_bin).
It appears that most of the wine has a sulphate level between 0.4 and 0.8.
## redWine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## redWine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## redWine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## redWine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## redWine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## redWine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
This is a summary of sugar content between the different quality ratings. Here, we can see that the average appears to be about the same within each quality rating, and does not appear to have a direct effect on the rating itself.
## redWine$quality: 3
## [1] 10532
## --------------------------------------------------------
## redWine$quality: 4
## [1] 42240
## --------------------------------------------------------
## redWine$quality: 5
## [1] 505290
## --------------------------------------------------------
## redWine$quality: 6
## [1] 540656
## --------------------------------------------------------
## redWine$quality: 7
## [1] 165601
## --------------------------------------------------------
## redWine$quality: 8
## [1] 14881
This is just a count of the number of wines in each quality rating
## redWine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## redWine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## redWine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## redWine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## redWine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## redWine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
This shows a numerical summary of the alcohol content sorted by the different quality levels
The structure of my dataset are numerical analysis of the data, along with bar charts that explore one variable at a time.
The main interest in my dataset is to explore the relationship between certain characteristics of the wine in relation to the quality rating.
The overall numerical analysis can help support my investigation into the relationship between characteristics of wine and its quality.
For this dataset, I did not create any new variables.
Through my analysis, I did not notice any unusual distributions. This was a clean and tidy dataset of wines.
This plot explores the different levels of alcohol content in the different quality ratings. Here we can see that the median alcohol content in wines is just above 10%.
This plot shows the relationship between pH levels and the density of the wine. We can see that the density trends downward as pH levels rise.
I created two new datasets using dplyr’s group_by() method to group the data by quality ratings and alcohol content.
Here, we can see the average alcohol and pH content for each quality rating, along with the number of wines in each category.
## # A tibble: 6 x 7
## alcohol pH_mean sugar_mean density_mean sulphate_mean quality_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.40 3.010000 1.95 1.0001000 0.7100000 4.5
## 2 8.50 3.150000 1.60 0.9991400 0.6500000 5.0
## 3 8.70 3.330000 2.10 0.9977500 0.8100000 6.0
## 4 8.80 3.160000 13.80 1.0024200 0.7500000 5.0
## 5 9.00 3.287667 3.06 0.9984173 0.6056667 5.4
## 6 9.05 3.390000 1.90 0.9958500 0.4300000 4.0
## # ... with 1 more variables: n <int>
The functions above show the average of certain wine characteristics, sorted by alcohol content. We then analyze one variable, sugar, to see if there is a relationship between the average sugar content and alcohol content. As you can see, there doesn’t seem to be a direct relationship.
However, there does seem to be a relationship with the average density of wine, and the alcohol content.
##
## Pearson's product-moment correlation
##
## data: redWine$fixed.acidity and redWine$volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
This was a test to see if there is a correlation between the fixed acidity and the voliatile acidity using the Pearson method. The Pearson method states that any value above 0.3 or below -0.3 means the two variables are significantly correlated. The result for this test is -0.256, which is close!
##
## Pearson's product-moment correlation
##
## data: redWine$alcohol and redWine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
This test helps confirm the previous analysis - the quality and alcohol content seem to be significantly correlated.
I started off trying to seek correlation between the quality of the wine and its characteristics. I initally thought that alcohol content may have a relationship with the quality rating. Also, I wanted to explore if any other characteristics of wine were directly correlated, or effected, by one another.
One characteristic I thought was interesting was the overall increase in the alcohol content in relation to the quality rating. While I did imagine a higher alcohol content would result in people enjoying the wine more, I thought that it would have a limit, or not be as strongly correlated as it showed.
The strongest relationship that I discovered through a plot seemed to be the alcohol content and quality rating, while the strongest relationship I discovered through the cor.test() function was between fixed and volatile acidity.
This plot shows explores the alcohol content in wine compared to its pH level, sorted out by the quality rating. When considering that a dot means there are 5 wines that have those characteristics, we can see where the majority of the wines are.
This plot shows that the amount of sulphates seem to be in a consistent range between 0.5 to 1 as the alcohol content increases. It also shows the lighter blues towards the higher alcohol content, which also indicates that it’s of higher quality.
Here, we explore the relationship between acidity and pH. We can see that There is a higher percentage of volatile acidity as the pH increases (which means it is less acidic). From here, we can infer that fixed acidity is better for lower pH levels compared to volatile acidity.
I wanted to see if there is any relationship between any of the characteristics of wine and its quality rating, but initially proposed the idea that there is no one direct relationship. Other than alcohol content, my additional analysis proves that it is not just one variable that is a reliable predictor of wine quality.
It was interesting to see that volatile acidity and its relationship with pH levels. When thinking of the word “voliatile”, you think of movement and action, which are some similar characterstics when I think of acidity. So it was interesting to me to see that it actually made the wine more basic.
I chose this plot that compares alcohol content to its respective pH levels, separated by the wine’s quality rating. While we do not see any direct relationship, we can see that the majority of wines have an alcohol content of around 10%, with a pH level of between 3.0 and 3.5.
I chose this plot beacuse the first question I had was “does the alcohol content have anything to do with its rating?” In my experience, people can just different wines with a heavy bias on its alcohol content. When I saw that the median was just above 10%, I understood that it was not judged with as significantly as I thought, even though most of the wine is rated a 5 or 6.
I chose this plot because it combines everything I have learned so far to show the great detail of the amount of acidity in relation to the wine’s pH levels.
Some struggles I found in exploring this dataset was trying to find a meaningful relationship between the variables that I can contribute to the overall quality of the wine. What did go well was disproving this idea that a singular factor can cause the quality of the wine to go up or down. It was surprising to see elements of the wine that you would not think to be linked together to show a dependent relationship. Moving forward, with datasets like this, additional work can be done, such as including more characteristics of the wine. This can allow for a greater search into what makes a particular wine rate higher than another.